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Abstract 

Bayesian model-based reinforcement learning is a formally elegant approach to 
learning optimal behaviour under model uncertainty, trading off exploration and 
exploitation in an ideal way. Unfortunately, finding the resulting B ayes-optimal 
policies is notoriously taxing, since the search space becomes enormous. In this 
paper we introduce a tractable, sample-based method for approximate B ayes- 
optimal planning which exploits Monte-Carlo tree search. Our approach outper- 
formed prior Bayesian model-based RL algorithms by a significant margin on sev- 
eral well-known benchmark problems - because it avoids expensive applications 
of Bayes rule within the search tree by lazily sampling models from the current 
beliefs. We illustrate the advantages of our approach by showing it working in 
an infinite state space domain which is qualitatively out of reach of almost all 
previous work in Bayesian exploration. 



1 Introduction 

A key objective in the theory of Markov Decision Processes (MDPs) is to maximize the expected 
sum of discounted rewards when the dynamics of the MDP are (perhaps partially) unknown. The 
discount factor pressures the agent to favor short-term rewards, but potentially costly exploration 
may identify better rewards in the long-term. This conflict leads to the well-known exploration- 
exploitation trade-off. One way to solve this dilemma (3] [TO) is to augment the regular state of the 
agent with the information it has acquired about the dynamics. One formulation of this idea is the 
augmented Bayes-Adaptive MDP (BAMDP) [18, 9], in which the extra information is the posterior 
belief distribution over the dynamics, given the data so far observed. The agent starts in the belief 
state corresponding to its prior and, by executing the greedy policy in the BAMDP whilst updating 
its posterior, acts optimally (with respect to its beliefs) in the original MDP. In this framework, rich 
prior knowledge about statistics of the environment can be naturally incorporated into the planning 
process, potentially leading to more efficient exploration and exploitation of the uncertain world. 

Unfortunately, exact Bayesian reinforcement learning is computationally intractable. Various algo- 
rithms have been devised to approximate optimal learning, but often at rather large cost. Here, we 
present a tractable approach that exploits and extends recent advances in Monte-Carlo tree search 
(MCTS) ESI El, but avoiding problems associated with applying MCTS directly to the BAMDP. 

At each iteration in our algorithm, a single MDP is sampled from the agent's current beliefs. This 
MDP is used to simulate a single episode whose outcome is used to update the value of each node of 
the search tree traversed during the simulation. By integrating over many simulations, and therefore 
many sample MDPs, the optimal value of each future sequence is obtained with respect to the agent's 
beliefs. We prove that this process converges to the B ayes-optimal policy, given infinite samples. To 
increase computational efficiency, we introduce a further innovation: a lazy sampling scheme that 
considerably reduces the cost of sampling. 

We applied our algorithm to a representative sample of benchmark problems and competitive al- 
gorithms from the literature. It consistently and significantly outperformed existing Bayesian RL 
methods, and also recent non-Bayesian approaches, thus achieving state-of-the-art performance. 
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Our algorithm is more efficient than previous sparse sampling methods for B ayes-adaptive planning 
l25l |2, partly because it does not update the posterior belief state during the course of each 
simulation. It thus avoids repeated applications of Bayes rule, which is expensive for all but the 
simplest priors over the MDR Consequently, our algorithm is particularly well suited to support 
planning in domains with richly structured prior knowledge — a critical requirement for applications 
of Bayesian reinforcement learning to large problems. We illustrate this benefit by showing that our 
algorithm can tackle a domain with an infinite number of states and a structured prior over the 
dynamics, a challenging — if not intractable — task for existing approaches. 



2 Bayesian RL 

We describe the generic Bayesian formulation of optimal decision-making in an unknown MDP, 
following 1 18] and [9]. An MDP is described as a 5-tuple M = {S,A,V,1l,i), where S is the 
set of states, A is the set of actions, V : S x A x S — » R is the state transition probability kernel, 
7Z : S x A ^ R is a. bounded reward function, and 7 is the discount factor [23]. When all the 
components of the MDP tuple are known, standard MDP planning algorithms can be used to estimate 
the optimal value function and policy off-line. In general, the dynamics are unknown, and we assume 
that P is a latent variable distributed according to a distribution P(V). After observing a history 
of actions and states h t = siaiS2&2 • • • Ut-iSt fr° m me MDP, the posterior belief on P is updated 
using Bayes' rule P(P\h t ) oc P(h t \P)P(P). The uncertainty about the dynamics of the model can 
be transformed into uncertainty about the current state inside an augmented state space S + = SxH, 
where S is the state space in the original problem and % is the set of possible histories. The dynamics 
associated with this augmented state space are described by 

P + ((s,h),a,(s',ti)) = t[ti = has'} [ P(s, a, s')P(P\h) dP, 1Z + ((s , h) , a) = R(s , a) (1) 

Jv 

Together, the 5-tuple M+ = (S + , A, V + , K+ , 7) forms the Bayes- Adaptive MDP (BAMDP) for 
the MDP problem M. Since the dynamics of the BAMDP are known, it can in principle be solved 
to obtain the optimal value function associated with each action: 



Q*((s t ,h t ),a) = maxE^ 



n>\<H 



(2) 



from which the optimal action for each state can be readily derived. [^Optimal actions in the BAMDP 
are executed greedily in the real MDP M and constitute the best course of action for a Bayesian 
agent with respect to its prior belief over P. It is obvious that the expected performance of the 
BAMDP policy in the MDP M is bounded above by that of the optimal policy obtained with a fully- 
observable model, with equality occurring, for example, in the degenerate case in which the prior 
only has support on the true model. 



3 The BAMCP algorithm 
3.1 Algorithm Description 

The goal of a BAMDP planning method is to find, for each decision point (s, h) encountered, the ac- 
tion a that maximizes Equation[2] Our algorithm, B ayes-adaptive Monte-Carlo Planning (BAMCP), 
does this by performing a forward-search in the space of possible future histories of the BAMDP 
using a tailored Monte-Carlo tree search. 

We employ the UCT algorithm fT6l to allocate search effort to promising branches of the state-action 
tree, and use sample-based rollouts to provide value estimates at each node. For clarity, let us denote 
by Bayes-Adaptive UCT (BA-UCT) the algorithm that applies vanilla UCT to the BAMDP (i.e., 
the particular MDP with dynamics described in Equation [TJ. Sample-based search in the BAMDP 
using BA-UCT requires the generation of samples from P + at every single node. This operation 
requires integration over all possible transition models, or at least a sample of a transition model P 
— an expensive procedure for all but the simplest generative models P(P). We avoid this cost by 
only sampling a single transition model P % from the posterior at the root of the search tree at the 

x The redundancy in the state-history tuple notation — s t is the suffix of h t — is only present to ensure 
clarity of exposition. 
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start of each simulation i, and using V 1 to generate all the necessary samples during this simulation. 
Sample-based tree search then acts as a filter, ensuring that the correct distribution of state successors 
is obtained at each of the tree nodes, as if it was sampled from P + . This root sampling method was 
originally introduced in the POMCP algorithm [20], developed to solve Partially Observable MDPs. 



3.2 BA-UCT with Root Sampling 

The root node of the search tree at a decision point represents the current state of the BAMDP. 
The tree is composed of state nodes representing belief states (s, ft) and action nodes represent- 
ing the effect of particular actions from their parent state node. The visit counts: N((s,h)) for 
state nodes, and N((s, ft), a) for action nodes, are initialized to and updated throughout search. 
A value Q((s,ft),a), initialized to 0, is also maintained for each action node. Each simulation 
traverses the tree without backtracking by following the UCT policy at state nodes defined by 
argmax a Q((s, ft), a) + cy/ l °s( N (( s > h )))/N({s, h),a), where c is an exploration constant that needs 
to be set appropriately. Given an action, the transition distribution V % corresponding to the current 
simulation i is used to sample the next state. That is, at action node ((5, ft), a), s f is sampled from 
V l {s, a, •), and the new state node is set to (s', has'). When a simulation reaches a leaf, the tree is 
expanded by attaching a new state node with its connected action nodes, and a rollout policy ir ro is 
used to control the MDP defined by the current V % to some fixed depth (determined using the dis- 
count factor). The rollout provides an estimate of the value Q((s, ft), a) from the leaf action node. 
This estimate is then used to update the value of all action nodes traversed during the simulation: if 
R is the sampled discounted return obtained from a traversed action node ((s, ft) , a) in a given sim- 
ulation, then we update the value of the action node to Q((s, ft), a) + R - Q(( s > h )^ a )/N(( Sj h),a) (i.e., 
the mean of the sampled returns obtained from that action node over the simulations). A detailed 
description of the BAMCP algorithm is provided in Algorithm [T] A diagram example of BAMCP 



simulations is presented in Figure S3 



The tree policy treats the forward search as a meta-exploration problem, preferring to exploit re- 
gions of the tree that currently appear better than others while continuing to explore unknown or 
less known parts of the tree. This leads to good empirical results even for small number of simu- 
lations, because effort is expended where search seems fruitful. Nevertheless all parts of the tree 
are eventually visited infinitely often, and therefore the algorithm will eventually converge on the 
B ayes-optimal policy (see Section [33] ). 

Finally, note that the history of transitions ft is generally not the most compact sufficient statistic 
of the belief in fully observable MDPs. Indeed, it can be replaced with unordered transition counts 
if), considerably reducing the number of states of the BAMDP and, potentially the complexity of 
planning. Given an addressing scheme suitable to the resulting expanding lattice (rather than to a 
tree), BAMCP can search in this reduced space. We found this version of BAMCP to offer only a 
marginal improvement. This is a common finding for UCT, stemming from its tendency to concen- 
trate search effort on one of several equivalent paths (up to transposition), implying a limited effect 
on performance of reducing the number of those paths. 



3.3 Lazy Sampling 

In previous work on sample-based tree search, indeed including POMCP |20], a complete sample 
state is drawn from the posterior at the root of the search tree. However, this can be computationally 
very costly. Instead, we sample V lazily, creating only the particular transition probabilities that are 
required as the simulation traverses the tree, and also during the rollout. 

Consider P(s, a, •) to be parametrized by a latent variable # s?a for each state and action pair. These 
may depend on each other, as well as on an additional set of latent variables </>. The posterior over 
V can be written as P(9|ft) = / P(9|0, ft)P(0|ft), where 9 = {0 s , a \s e S,a e A}. Define 

e t = {o sl . 

,015'" i@s t ,a t } as the (random) set of 6 parameters required during the course of a 
BAMCP simulation that starts at time 1 and ends at time t. Using the chain rule, we can rewrite 

p(9|0, ft) = p{o suai 10, h)p(e S2 , a2 |e l5 0, ft) . . . p(e ST , aT |e T _!, 0, ft)p(e \ e T |e T , </>, ft) 

where T is the length of the simulation and 9 \ 9t denotes the (random) set of parameters that 
are not required for a simulation. For each simulation i, we sample P((j)\h t ) at the root and then 
lazily sample the 8uat parameters as required, conditioned on <j> and all &t-i parameters sampled 
for the current simulation. This process is stopped at the end of the simulation, potentially before 
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Algorithm 1: BAMCP 



procedure Search ( (s, ft) ) 
repeat 

V ~P{V\h) 

Simulate ((s,h),P,0) 
until Timeout ( ) 
return argmax Q((s,h),a) 

a 

end procedure 



procedure Rollout ((s,h),V,d ) 
if j d Rmax < e then 

return 
end 

a ~ 7r ro ((s,ft),-) 
s'~7>(s,a,.) 
r ^— 7£(s, a) 

return r+7Rollout ( (s', has'),V, 
end procedure 



procedure Simulate ( (s,h),V,d) 
if j d Rmax < e then return 

if N((s,h)) = Othen 
for all a G A do 

N((s,h),a) <-0,Q((s,h),a)) ^0 

end 

a ~ 7r ro ((s,ft), •) 
s'~7>(s,a,.) 
r «— 7£(s, a) 

r + 7 Rollout ({s',has'),V,d) 
N((s,h)) <- 1, JV((s,ft),a) <- 1 
Q({s,h),a)^R 
return 

end 



log(AT(( 5 ,/i))) 
N((s,h),b) 



a ^— argmax Q((s, ft), 6) + 

6 

s' ~ a, •) 
r ^— 7£(s, a) 

i? <- r + 7 Simulate ( (s', ftas'), d+1) 

iV(( S ,ft))^iV(( S ,ft)) + l 

N((s,h),a) <- N((s,h),a) + 1 
Q((s, ft), a) <- Q((s, ft), a) + 
return 
end procedure 



all parameters have been sampled. For example, if the transition parameters for different states 
and actions are independent, we can completely forgo sampling a complete V, and instead draw any 
necessary parameters individually for each state-action pair. This leads to substantial performance 
improvement, especially in large MDPs where a single simulation only requires a small subset of 
parameters (see for example the domain in Section [5^5). 



3.4 Rollout Policy Learning 

The choice of rollout policy 7r ro is important if simulations are few, especially if the domain does 
not display substantial locality or if rewards require a carefully selected sequence of actions to be 
obtained. Otherwise, a simple uniform random policy can be chosen to provide noisy estimates. 
In this work, we learn Q ro , the optimal Q-value in the real MDP, in a model-free manner (e.g., 
using Q-learning) from samples (st, a t , r t: st+i) obtained off-policy as a result of the interaction 
of the Bayesian agent with the environment. Acting greedily according to Q ro translates to pure 
exploitation of gathered knowledge. A rollout policy in BAMCP following Q ro could therefore 
over-exploit. Instead, similar to fi3lL we select an e-greedy policy with respect to Q ro as our rollout 
policy 7r ro . This biases rollouts towards observed regions of high rewards. This method provides 
valuable direction for the rollout policy at negligible computational cost. More complex rollout 
policies can be considered, for example rollout policies that depend on the sampled model V % . 
However, these usually incur computational overhead. 



3.5 Theoretical properties 

Define V((s,h)) = maxQ((s, ft), a) V(s,ft) e S x H. 

Theorem 1. For all e > (the numerical precision, see Algorithm [7]) and a suitably cho- 
sen c (e.g. c > from state (s tj h t ), BAMCP constructs a value function at the root 

node that converges in probability to an e' -optimal value function, V((s t: h t )) A V*,((s t: ht)), 
where e' = Moreover, for large enough N((s t ,h t }), the bias of V((s tl h t )) decreases as 

0(log(N((s t , h t )))/N((s t ,ht))). (Proof available in supplementary material) 
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By definition, Theorem [T] implies that BAMCP converges to the B ayes-optimal solution asymp- 
totically. We confirmed this result empirically using a variety of Bandit problems, for which the 
B ayes-optimal solution can be computed efficiently using Gittins indices (see supplementary mate- 
rial). 

4 Related Work 

In Section [5] we compare BAMCP to a set of existing Bayesian RL algorithms. Given limited 
space, we do not provide a comprehensive list of planning algorithms for MDP exploration, but 
rather concentrate on related sample-based algorithms for Bayesian RL. 

Bayesian DP l22l maintains a posterior distribution over transition models. At each step, a single 
model is sampled, and the action that is optimal in that model is executed. The Best Of Sampled Set 
(BOSS) algorithm generalizes this idea 1 1 ]. BOSS samples a number of models from the posterior 
and combines them optimistically. This drives sufficient exploration to guarantee finite- sample per- 
formance guarantees. BOSS is quite sensitive to its parameter that governs the sampling criterion. 
Unfortunately, this is difficult to select. Castro and Precup proposed an SB OSS variant, which pro- 
vides a more effective adaptive sampling criterion Q. BOSS algorithms are generally quite robust, 
but suffer from over-exploration. 

Sparse sampling fT5ll is a sample-based tree search algorithm. The key idea is to sample successor 
nodes from each state, and apply a Bellman backup to update the value of the parent node from the 
values of the child nodes. Wang et al. applied sparse sampling to search over belief-state MDPs l25l . 
The tree is expanded non-uniformly according to the sampled trajectories. At each decision node, a 
promising action is selected using Thompson sampling — i.e., sampling an MDP from that belief- 
state, solving the MDP and taking the optimal action. At each chance node, a successor belief-state 
is sampled from the transition dynamics of the belief-state MDP. 

Asmuth and Littman further extended this idea in their BFS3 algorithm [ 2 ], an adaptation of Forward 
Search Sparse Sampling (24) to belief-MDPs. Although they described their algorithm as Monte- 
Carlo tree search, it in fact uses a Bellman backup rather than Monte-Carlo evaluation. Each Bellman 
backup updates both lower and upper bounds on the value of each node. Like Wang et al., the tree 
is expanded non-uniformly according to the sampled trajectories, albeit using a different method for 
action selection. At each decision node, a promising action is selected by maximising the upper 
bound on value. At each chance node, observations are selected by maximising the uncertainty 
(upper minus lower bound). 

Bayesian Exploration Bonus (BEB) solves the posterior mean MDP, but with an additional reward 
bonus that depends on visitation counts ifTTl . Similarly, Sorg et al. propose an algorithm with a 
different form of exploration bonus EH . These algorithms provide performance guarantees after 
a polynomial number of steps in the environment. However, behavior in the early steps of explo- 
ration is very sensitive to the precise exploration bonuses; and it turns out to be hard to translate 
sophisticated prior knowledge into the form of a bonus. 



Table 1 : Experiment results summary. For each algorithm, we report the mean sum of rewards and confidence 
interval for the best performing parameter within a reasonable planning time limit (0.25 s/step for Double-loop, 
1 s/step for Grid5 and GridlO, 1.5 s/step for the Maze). For BAMCP, this simply corresponds to the number 
of simulations that achieve a planning time just under the imposed limit. * Results reported from 1 22 1 without 
timing information. 





Double-loop 


Grid5 


GridlO 


Dearden's Maze 


BAMCP 


387.6 ± 1.5 


72.9 ± 3 


32.7 ± 3 


965.2 ± 73 


BFS3 


382.2 ± 1.5 


66 ±5 


10.4 ± 2 


240.9 ± 46 


SBOSS 


371.5 ±3 


59.3 ±4 


21.8 ±2 


671.3 ± 126 


BEB (ID 


386 ±0 


67.5 ± 3 


10 ± 1 


184.6 ±35 


Bayesian DP* (22) 


377 ± 1 








Bayes VPI+MIX* |8| 


326 ±31 






817.6 ±29 


IEQL+* G3D 


264 ± 1 






269.4 ± 1 


QL Boltzmann* 


186 ± 1 






195.2 ± 20 
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5 Experiments 

We first present empirical results of BAMCP on a set of standard problems with comparisons to 
other popular algorithms. Then we showcase BAMCP's advantages in a large scale task: an infinite 
2D grid with complex correlations between reward locations. 

5.1 Standard Domains 
Algorithms 

The following algorithms were run: BAMCP - The algorithm presented in Section [3] implemented 
with lazy sampling. The algorithm was run for different number of simulations (10 to 10000) to 
span different planning times. In all experiments, we set 7r ro to be an e-greedy policy with e = 0.5. 
The UCT exploration constant was left unchanged for all experiments (c = 3), we experimented 
with other values of c £ {0.5, 1, 5} with similar results. SBOSS |5|: for each domain, we varied 
the number of samples K £ {2, 4, 8, 16, 32} and the resampling threshold parameter 5 £ {3, 5, 7}. 
BEB 03): for each domain, we varied the bonus parameter j3 £ {0.5, 1, 1.5, 2, 2.5, 3, 5, 10, 15, 20}. 
BFS3 [2] for each domain, we varied the branching factor C £ {2, 5, 10, 15} and the number of 
simulations (10 to 2000). The depth of search was set to 15 in all domains except for the larger grid 
and maze domain where it was set to 50. We also tuned the V m3LX parameter for each domain — V m { n 
was always set to 0. In addition, we report results from [ 22 ] for several other prior algorithms. 

Domains 

For all domains, we fix 7 = 0.95. The Double-loop domain is a 9- state deterministic MDP with 2 
actions 0, 1000 steps are executed in this domain. Grid5 is a 5 x 5 grid with no reward anywhere 
except for a reward state opposite to the reset state. Actions with cardinal directions are executed 
with small probability of failure for 1000 steps. GridlO is a 10 x 10 grid designed like Grid5. We 
collect 2000 steps in this domain. Dearden's Maze is a 264-states maze with 3 flags to collect 0. 
A special reward state gives the number of flags collected since the last visit as reward, 20000 steps 
are executed in this domain. 

To quantify the performance of each algorithm, we measured the total undiscounted reward over 
many steps. We chose this measure of performance to enable fair comparisons to be drawn with 
prior work. In fact, we are optimising a different criterion - the discounted reward from the start 
state - and so we might expect this evaluation to be unfavourable to our algorithm. 

One major advantage of Bayesian RL is that one can specify priors about the dynamics. For the 
Double-loop domain, the Bayesian RL algorithms were run with a simple Dirichlet-Multinomial 
model with symmetric Dirichlet parameter a = . For the grids and the maze domain, the algo- 
rithms were run with a sparse Dirichlet-Multinomial model, as described in ifTTTl . For both of these 
models, efficient collapsed sampling schemes are available; they are employed for the BA-UCT and 
BFS3 algorithms in our experiments to compress the posterior parameter sampling and the transition 
sampling into a single transition sampling step. This considerably reduces the cost of belief updates 
inside the search tree when using these simple probabilistic models. In general, efficient collapsed 



sampling schemes are not available (see for example the model in Section 5.2 ). 
Results 

A summary of the results is presented in Table T] Figure [T] reports the planning time/performance 
trade-off for the different algorithms on the Grid5 and Maze domain. 

On all the domains tested, BAMCP performed best. Other algorithms came close on some tasks, 
but only when their parameters were tuned to that specific domain. This is particularly evident for 
BEB, which required a different value of exploration bonus to achieve maximum performance in 
each domain. BAMCP's performance is stable with respect to the choice of its exploration constant 
c and it did not require tuning to obtain the results. 

BAMCP's performance scales well as a function of planning time, as is evident in Figure [T] In con- 
trast, SBOSS follows the opposite trend. If more samples are employed to build the merged model, 
SBOSS actually becomes too optimistic and over-explores, degrading its performance. BEB cannot 
take advantage of prolonged planning time at all. BFS3 generally scales up with more planning 
time with an appropriate choice of parameters, but it is not obvious how to trade-off the branching 
factor, depth, and number of simulations in each domain. BAMCP greatly benefited from our lazy 



2 The result reported for Dearden's maze with the Bayesian DP alg. in ['22 1 is for a different version of the 
task in which the maze layout is given to the agent. 
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(d) 



Figure 1 : Performance of each algorithm on the Grid5 (a.) and Maze domain (b-d) as a function of planning 
time. Each point corresponds to a single run of an algorithm with an associated setting of the parameters. In- 
creasing brightness inside the points codes for an increasing value of a parameter (BAMCP and BFS3: number 
of simulations, BEB: bonus parameter ft, SBOSS: number of samples K). A second dimension of variation 
is coded as the size of the points (BFS 3: branching factor C, SBOSS: resampling parameter S). The range of 
parameters is specified in Section [5TT] a. Performance of each algorithm on the Grid5 domain, b. Performance 
of each algorithm on the Maze domain, c. On the Maze domain, performance of vanilla BA-UCT with and 
without rollout policy learning (RL). d. On the Maze domain, performance of BAM CP with and without the 
lazy sampling (LS) and rollout policy learning (RL) presented in Sections |3.4[ |3.3| Root sampling (RS) is 
included. 

sampling scheme in the experiments, providing 35 x spee d imp rovement over the naive approach in 
the maze domain for example; this is illustrated in Figure l(c)| 

Dearden's maze aptly illustrates a major drawback of forward search sparse sampling algorithms 
such as BFS3. Like many maze problems, all rewards are zero for at least k steps, where k is the 
solution length. Without prior knowledge of the optimal solution length, all upper bounds will be 
higher than the true optimal value until the tree has been fully expanded up to depth k - even if a 
simulation happens to solve the maze. In contrast, once BAMCP discovers a successful simulation, 
its Monte-Carlo evaluation will immediately bias the search tree towards the successful trajectory. 

5.2 Infinite 2D grid task 

We also applied BAMCP to a much larger problem. The generative model for this infinite-grid 
MDP is as follows: each column i has an associated latent parameter pi ~ Beta(ai, f3\) and each 
row j has an associated latent parameter qj ~ Beta(a2, ^2). The probability of grid cell ij having 
a reward of 1 is Piqj, otherwise the reward is 0. The agent knows it is on a grid and is always free 
to move in any of the four cardinal directions. Rewards are consumed when visited; returning to the 
same location subsequently results in a reward of 0. As opposed to the independent Dirichlet priors 
employed in standard domains, here, dynamics are tightly correlated across states (i.e., observing 
a state transition provides information about other state transitions). Posterior inference (of the 
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1Cf 2 10~ 1 10° 10 1 1Cf 2 10~ 1 10° 10 1 

Planning time (s) Planning time (s) 



Figure 2: Performance of BAMCP as a function of planning time on the Infinite 2D grid task of Section |5^2| 
for 7 = 0.97, where the grids are generated with Beta parameters a± = 1, /5i = 2, a 2 = 2, fa = 1 (See 
supp. Figure[S4]for a visualization). The performance during the first 200 steps in the environment is averaged 
over 50 sampled environments (5 runs for each sample) and is reported both in terms of undiscounted (left) and 
discounted (right) sum of rewards. BAMCP is run either with the correct generative model as prior or with an 
incorrect prior (parameters for rows and columns are swapped), it is clear that BAMCP can take advantage of 
correct prior information to gain more rewards. The performance of a uniform random policy is also reported. 



dynamics V) in this model requires approximation because of the non-conjugate coupling of the 
variables, the inference is done via MCMC (details in Supplementary). The domain is illustrated in 



Figure S4 



Planning algorithms that attempt to solve an MDP based on sample(s) (or the mean) of the posterior 
(e.g., BOSS, BEB, Bayesian DP) cannot directly handle the large state space. Prior forward-search 
methods (e.g., BA-UCT, BFS3) can deal with the state space, but not the large belief space: at every 
node of the search tree they must solve an approximate inference problem to estimate the posterior 
beliefs. In contrast, BAMCP limits the posterior inference to the root of the search tree and is not 
directly affected by the size of the state space or belief space, which allows the algorithm to perform 
well even with a limited planning time. Note that lazy sampling is required in this setup since a full 
sample of the dynamics involves infinitely many parameters. 

Figure [2] (and Figure |S5] ) demonstrates the planning performance of BAMCP in this complex do- 
main. Performance improves with additional planning time, and the quality of the prior clearly 
affects the agent's performance. Supplementary videos contrast the behavior of the agent for differ- 
ent prior parameters. 

6 Future Work 

The UCT algorithm is known to have several drawbacks. First, there are no finite-time regret bounds. 
It is possible to construct malicious environments, for example in which the optimal policy is hidden 
in a generally low reward region of the tree, where UCT can be misled for long periods Q. Second, 
the UCT algorithm treats every action node as a multi-armed bandit problem. However, there is 
no actual benefit to accruing reward during planning, and so it is in theory more appropriate to use 
pure exploration bandits (4). Nevertheless, the UCT algorithm has produced excellent empirical 
performance in many domains lfT2l . 

BAMCP is able to exploit prior knowledge about the dynamics in a principled manner. In principle, 
it is possible to encode many aspects of domain knowledge into the prior distribution. An important 
avenue for future work is to explore rich, structured priors about the dynamics of the MDP. If this 
prior knowledge matches the class of environments that the agent will encounter, then exploration 
could be significantly accelerated. 

7 Conclusion 

We suggested a sample-based algorithm for Bayesian RL called BAMCP that significantly surpassed 
the performance of existing algorithms on several standard tasks. We showed that BAMCP can 
tackle larger and more complex tasks generated from a structured prior, where existing approaches 
scale poorly. In addition, BAMCP provably converges to the B ayes-optimal solution. 

The main idea is to employ Monte-Carlo tree search to explore the augmented B ayes-adaptive search 
space efficiently. The naive implementation of that idea is the proposed BA-UCT algorithm, which 
cannot scale for most priors due to expensive belief updates inside the search tree. We introduced 
three modifications to obtain a computationally tractable sample-based algorithm: root sampling, 
which only requires beliefs to be sampled at the start of each simulation (as in [ 20 1); a model-free 
RL algorithm that learns a rollout policy; and the use of a lazy sampling scheme to sample the 
posterior beliefs cheaply. 
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Supplementary Material 



Proof of Theorem [T]and comments 

Consider the BA-UCT algorithm: UCT applied to the B ayes- Adaptive MDP (dynamics are de- 
scribed in Equation [T]). Let V 71 \Ht) be the rollout distribution of BA-UCT: the probability that 
history hr is generated when running the BA-UCT search from (s t: ht), with h t a prefix of hr, 
T — t the effective horizon in the search tree, and tt an arbitrary BAMDP policy. Similarly define 
the similar quantity V (ft T ) : the probability that history Jit is generated when running the BAMCP 
algorithm. The following lemma shows that these two quantities are in fact equivalent^] 

Lemma 1. V*(h T ) = if (h T ) for all BAMDP policies tt : U A. 

Proof Let tt be arbitrary. We show by induction that for all suffix histories hofh t , V* (ft) = if (ft) ; 
but also P(P | ft) = Ph(P) where P(P |ft) denotes (as before) the posterior distribution over the 
dynamics given ft and Ph(P) denotes the distribution of V at node ft when running BAMCP. 

Base case: At the root (ft = h t , suffix history of size 0), it is clear that i\ (V) = P(P \h t ) since we 
are sampling from the posterior at the root node and D^(h t ) = V (ft t ) = 1 since all simulations 
go through the root node. 

Step case: 

Assume proposition true for all suffices of size i. Consider any suffix has' of size i + 1, where a G A 
and s' G S are arbitrary and ft is an arbitrary suffix of size i ending in s. The following relation 
holds: 

V*{has') =P 7r (ft)7r(ft,a) f dV P(V |ft) P(s, a, s f ) (3) 

Jv 

= Tf{h)ir(h,a) [ dPP h (P)P(s,a,s f ) (4) 
Jv 

= if {has'), (5) 

where the second line is obtained using the induction hypothesis, and the rest from the definitions. 
In addition, we can match the distribution of the samples V at node has': 

P(V \has') = P{has'\ V)P(V) / P(has') (6) 

= P(h\ P)P(V) P(s, a, s')/P(has') (7) 

= P(P \h)P(h) P(s, a, s')/P{has') (8) 

= ZP(P\h)P(s,a,s r ) (9) 

= ZP h (V)V(s,a,s') (10) 

= ZP ha {P)P{s,a,s') (11) 

= Phas>(P), (12) 



where Equation 10 is obtained from the induction hypothesis, Equation 11 is obtained from the 
fact that the choice of action at each node is made independently of the samples V. Finally, to 
obtain Equation [12] from Equation [TT] consider the probability that a sample V arrives at node 



has' , it first needs to traverse node ha (this occurs with probability PhaiP)) and then, from 
node ha, the state s' needs to be sampled (this occurs with probability V(s J a J s')); therefore, 
Phas'iP) oc Ph a (V)V(s,a,s'). Z is the normalization constant: Z = x j f v v(s,a,s')P(v\h) = 
V S v V{s,a,s')P h {V). This completes the induction. □ 

Proof of Theorem^ The UCT analysis in Kocsis and Szepesvari |[T6ll applies to the BA-UCT algo- 
rithm, since it is vanilla UCT applied to the BAMDP (a particular MDP). By Lemma [T] BAMCP 

3 For ease of notation, we refer to a node with its history as opposed to its state and history as done in the 
rest of the paper. 
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simulations are equivalent in distribution to BA-UCT simulations. The nodes in BAMCP are there- 
fore being evaluated as in BA-UCT, providing the result. □ 

Lemma [T] provides some intuition for why belief updates are unnecessary in the search tree: the 
search tree filters the samples from the root node so that the distribution of samples at each node 
is equivalent to the distribution obtained when explicitly updating the belief. In particular, the root 
sampling in POMCP [ 20 ] and BAMCP is different from evaluating the tree using the posterior mean. 
This is illustrated empirically in the section below in the case of simple Bandit problems. 



BAMCP versus Gittins indices 



BAMCP - Number of simulations: 5000 BAMCP - Number of simulations: 250000 




5 10 15 20 5 10 15 20 

a a 



BAMCP - Number of simulations: 2500000 BAMCP - Number of simulations: 5000000 




5 10 15 20 5 10 15 20 

a a 



Posterior mean decision 




5 10 15 20 



Figure SI: Evaluation of BAMCP against the B ayes-optimal policy, for the case 7 = 0.95, when choosing 
between a deterministic arm with reward 0.5 and a stochastic arm with reward 1 with posterior probability 
p ~ Beta(a, /3). The result is tabulated for a range of values of a,/3, each cell value corresponds to the 
probability of making the correct decision (computed over 50 runs) when compared to the Gittins indices fl4l 
for the corresponding posterior. The first four tables corresponds to different number of simulations for BAMCP 
and the last table shows the performance when acting according to the posterior mean. In this range of a, ft 
values, the Gittins indices for the stochastic arm are larger than 0.5 (i.e., selecting the stochastic arm is optimal) 
for p < a + 1 but also (3 — a + 2 for a > 6. Acting according to the posterior mean is different than 
the Bayes-optimal decision when f3 >= a and the Gittins index is larger than 0.5. BAMCP is guaranteed 
to converges to the Bayes-optimal decision in all cases, but convergence is slow for the edge cases where the 
Gittins index is close to 0.5 (e.g., For a = 17, /3 = 19, the Gittins index is 0.5044 which implies a value 
of 0.5044/(1 - 7) = 10.088 for the stochastic arm versus a value of 0.5 + 7 x 10.088 = 10.0836 for the 
deterministic arm). 
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Bayes-optimal BAMCP 
(a) 



Posterior Mean 
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Bayes-optimal BAMCP 
(b) 



Posterior Mean 



Figure S2: Performance comparison of BAMCP (50000 simulations, 100 runs) against the posterior mean 
decision on an 8-armed Bernoulli bandit with 7 = 0.99 after 300 steps. The arms' success probability are all 
0.6 except for one arm which has success probability 0.9. The Bayes-optimal result is obtained from 1000 runs 
with the Gittins indices |[T4l . a. Mean sum of rewards after 300 steps, b. Mean sum of discounted rewards after 
300 steps. 



Inference details for the infinite 2D grid task of Section [5^2 

We construct a Markov Chain using the Metropolis-Hastings algorithm to sample from the poste- 
rior distribution of ro w an d column parameters given observed transitions, following the notation 
introduced in Section [5^2) Let O = {(i, j)} be the set of observed reward locations, each as- 
sociated with an observed reward £ {0, 1}. The proposal distribution chooses a row-column 
pair (i p ,j p ) from O uniformly at random, and samples 
q • ~ Beta(a 2 + m 2 , /3 2 + n 2 ), where m 1 = Y^u 



on that column) and ri\ = (1 — ^/2(a 2 + #2)) 



(i,j)eO ± i=i P 



Beta(«i + mi,/?i + n\) and 
'ij (i.e., the sum of rewards observed 



(1 



' ^3 



), and similarly for m 2 ,n 2 



(mutatis mutandis). The ri\ term for the proposed column parameter p i has this rough correction 
term, based on the prior mean failure of the row parameters, to account for observed rewards on the 
column due to potentially low row parameters. Since the proposal is biased with respect to the true 
conditional distribution (from which we cannot sample), we also prevent the proposal distribution 
from getting too peaked. Better proposals (e.g., taking into account the sampled row parameters) 
could be devised, but they would likely introduce additional computational cost and the proposal 
above generated large enough acceptance probabilities (generally above 0.5 for our experiments). 
All other parameters pi , qj such that i or j is present in O are kept from the last accepted samples 
(i.e., p i = Pi and = pj for these is and js), and all parameters p^ qj that are not linked to ob- 
servations are (lazily) resampled from the prior — they do not influence the acceptance probability. 
We denote by Q(p, q — » p, q) the probability of proposing the set of parameters p and q from the 
last accepted sample of column/row parameters p and q. The acceptance probability A can then be 
computed as A = min(l, A') where: 



A' 



P(p,q|fo)Q(p,q P,q) 
P(p,q|h)Q(p,q P,q) 

PiP: g)Q(p, q -> p, q)^(fe| P, q) 

^(P, q)Q(p, q ^ P, q)P(h\ p, q) 



n ln m 2 



1j P ) n2 U(i,j)eo ^ = ip or i = 3p](Pi %') ry i 1 ~ Pi % 



pr(i 



t[i = i p or j = j P ](piqj) rij (I - Piqj) 1 ' 



•±j p j Ll(i,j)eo 

The last accepted sampled is employed whenever a sample is rejected. Finally, reward values Rij 
are resampled lazily based on the last accepted sample of the parameters pi , qj , when they have not 
been observed already. We omit the implicit deterministic mapping to obtain the dynamics V from 
these parameters. 
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Figure S3: This diagram presents the first 4 simulations of BAMCP in an MDP with 2 actions from state 
{stj h t ). The rollout trajectories are represented with dotted lines (green for the current rollouts, and greyed 
out for past rollouts). 1. The root node is expanded with two action nodes. Action a\ is chosen at the root 
(random tie-breaking) and a rollout is executed in V 1 with a resulting value estimate of 0. Counts N((st, h t )) 
and N((s t , h t ), ai), and value Q({s t , h t ), a±) get updated. 2. Action a<i is chosen at the root and a rollout is 
executed with value estimate 0. Counts and value get updated. 3. Action a\ is chosen (tie-breaking), then s is 
sampled from V 3 (s t , oi , •). State node h t ais) gets expanded and action oi is selected, incurring a reward 
of 2, followed by a rollout. 4. The UCB rule selects action a\ at the top, the successor state s x is sampled from 
V 4 (s t , ai , •). Action ci2 is chosen from the internal node (s f ,h t ais), followed by a rollout using V 4 and 7r ro . 
A reward of 2 is obtained after 2 steps from that tree node. Counts for the traversed nodes are updated and the 

MC backup updates Q((s x , h t ais'), oi) to = + 7O + 7 2 2 + 7 3 H = 7 2 2 and Q{{s t ,h t ),ai) to 

7 + 2 7 3 - 7 / 3 = |( 7 + 7 3 ). 
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Figure S4: A portion of an infinite 2D grid task generated with Beta distribution parameters ai = 1, Pi = 2 
(columns) and a 2 = 2,/3 2 = 1 (rows). Black squares at location (i,j) indicates a reward of 1, the circles 
represent the corresponding parameters pi (blue) and qj (orange) for each row and column (area of the circle is 
proportional to the parameter value). One way to interpret these parameters is that following column i implies 
a collection of 2pi/3 reward on average (2/3 is the mean of a Beta(2, 1) distribution) whereas following any 
row j implies a collection of qj/3 reward on average; but high values of parameters pi are less likely than high 
values parameters qj. These parameters are employed for the results presented in Figure [2] 





(b) 



Planning time (s) 



Planning time (s) 



Fi gure S5i Performance of BAMCP on the Infinite 2D grid task of Section |5.2[ for = 0.97, as in Figure [2] 
but where the grids are generated with Beta parameters (a) a± = 0.5, Pi = 0.5, a 2 = 0.5, p 2 = 0.5 and (b) 
ol\ — 0.5, P\ — 0.5, a 2 = 1, Pi — 3. In the wrong prior scenario (green dotted line), BAMCP is given the 
parameters (a) on = A, Pi l,a 2 0.5, p 2 0.5 and (b) ai = I, Pi = 3, a 2 0.5, p 2 0.5. The 
behavior of the agent is qualitatively different depending on the prior parameters employed (see supplementary 
videos). For example, for the scenario in Figure [2] rewards are often found in relatively dense blocks on the 
map and the agents exploits this fact when exploring; for the scenario (b) of this Figure, good rewards rates can 
be obtained by following the rare rows that have high qj parameters, but finding good rows can be expensive so 
the agent might settle on sub-optimal rows (as in Bandit problems where the B ayes-optimal agent might settle 
on sub-optimal arm if it believes it likely is the best arm given past data). It should be pointed out that the actual 
B ayes-optimal strategy in this domain is not known — the behavior of BAMCP for finite planning time might 
not qualitatively match the B ayes-optimal strategy. 
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